{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# IPEX LLM Optimizations - Inference for Single Instance\n",
    "\n",
    "This notebook will show you how to enable LLM optimizations using Intel® Extension for PyTorch*. You will be able to run an LLM of your choice without any optimizations and compare the results after applying optimizations. This notebook will run *meta-llama/Meta-Llama-3.1-8B-Instruct* so you will need a HuggingFace token and request access to this model first. You can do so [here](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct). Refer to the table in the README for the list of supported models."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Install Requirements and Set Up Environment\n",
    "\n",
    "Prerequisite: You must have PyTorch* and Intel® Extension for PyTorch* pre-installed, along with Jupyter Notebook and the ipykernel set up prior to running this notebook. You can follow these installation [instructions](https://github.com/intel/intel-extension-for-pytorch/tree/main/examples/cpu/inference/python/jupyter-notebooks#environment-setup)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Install requirements and packages\n",
    "!python -m pip install -r ../requirements.txt\n",
    "!python -m pip install accelerate huggingface-hub\n",
    "\n",
    "# Restart the kernel for changes to take effect, then proceed to the next cell\n",
    "exit()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Library imports\n",
    "import sys\n",
    "import os\n",
    "import pathlib\n",
    "from time import time\n",
    "import numpy as np\n",
    "from itertools import chain\n",
    "import torch\n",
    "import intel_extension_for_pytorch as ipex\n",
    "\n",
    "torch._C._jit_set_texpr_fuser_enabled(False)\n",
    "try:\n",
    "    ipex._C.disable_jit_linear_repack()\n",
    "except Exception:\n",
    "    pass\n",
    "\n",
    "import transformers\n",
    "from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "source": [
    "For this example, we will be running with Meta-Llama-3.1-8B-Instruct, so we will need to log in to HuggingFace using your own token. You can generate your token [here](https://huggingface.co/docs/hub/en/security-tokens)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!huggingface-cli login --token <your HF token>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Select the model you wish to run with here by specifying the HuggingFace model card."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_id = \"meta-llama/Meta-Llama-3.1-8B-Instruct\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Running an LLM in BF16 using Advanced Matrix Extensions (AMX) and Intel® Extension for PyTorch*\n",
    "The following code below will perform inference leveraging AMX BF16 and LLM optimizations from Intel® Extension for PyTorch*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Datatype\n",
    "dtype = \"bfloat16\"\n",
    "amp_enabled = True if dtype != \"float32\" else False\n",
    "amp_dtype = getattr(torch, dtype)\n",
    "\n",
    "# Load model\n",
    "config = AutoConfig.from_pretrained(model_id, torchscript=False, trust_remote_code=True)\n",
    "model = AutoModelForCausalLM.from_pretrained(\n",
    "    model_id,\n",
    "    torch_dtype=amp_dtype,\n",
    "    config=config,\n",
    "    low_cpu_mem_usage=True,\n",
    "    trust_remote_code=True,\n",
    ")\n",
    "model.config.token_latency = True # To print out additional performance metrics\n",
    "tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)\n",
    "model = model.eval()\n",
    "\n",
    "# Customizeable hyperparamters\n",
    "batch_size = 1\n",
    "num_beams = 1\n",
    "generate_kwargs = dict(do_sample=False, temperature=0.9, num_beams=num_beams)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following code changes are applied to use the channels last memory format NHWC and *ipex.llm.optimize* API to optimize performance. The channels last format is better for most key CPU operators. *ipex.llm.optimize* will optimize transformer-based models within frontend Python modules by optimizing operators or combining certain operators. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = model.to(memory_format=torch.channels_last)\n",
    "model = ipex.llm.optimize(\n",
    "    model,\n",
    "    dtype=amp_dtype,\n",
    "    inplace=True,\n",
    "    deployment_mode=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Input prompt\n",
    "prompt = input()\n",
    "input_size = tokenizer(prompt, return_tensors=\"pt\").input_ids.size(dim=1)\n",
    "print(\"---- Prompt size:\", input_size)\n",
    "prompt = [prompt] * batch_size"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run inference using 32 max output tokens, then repeat with 128 tokens. Here we are using auto-mixed precision to execute inference in BF16. You will notice the output from *model.generate* will also give the token latencies. Compare the overall latency numbers to the FP32 results and you will notice there is significant performance improvement from using the channels last format and *ipex.llm.optimize*. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def run_inference(model, tokenizer, prompt, max_new_tokens=32, num_warmup=5, num_iter=5):\n",
    "    total_time = 0.0\n",
    "    total_list = []\n",
    "    with torch.no_grad(), torch.inference_mode(), torch.cpu.amp.autocast(\n",
    "        enabled=amp_enabled\n",
    "    ):\n",
    "        input_ids = tokenizer(prompt, return_tensors=\"pt\").input_ids\n",
    "        \n",
    "        # Warm up\n",
    "        print(\"Warm up\")\n",
    "        for i in range(num_warmup):\n",
    "            model.generate(input_ids, max_new_tokens=max_new_tokens, **generate_kwargs)\n",
    "        \n",
    "        # Inference\n",
    "        print(\"Start inference\")\n",
    "        for i in range(num_iter):\n",
    "            tic = time()\n",
    "            output = model.generate(input_ids, max_new_tokens=max_new_tokens, **generate_kwargs)\n",
    "            gen_ids = output[0]\n",
    "            gen_text = tokenizer.batch_decode(gen_ids, skip_special_tokens=True)\n",
    "            toc = time()\n",
    "\n",
    "            input_tokens_lengths = [x.shape[0] for x in input_ids]\n",
    "            output_tokens_lengths = [x.shape[0] for x in gen_ids]\n",
    "            total_new_tokens = [\n",
    "                o - i for i, o in zip(input_tokens_lengths, output_tokens_lengths)\n",
    "            ]\n",
    "            \n",
    "            print(gen_text, total_new_tokens, flush=True)\n",
    "            print(\"Iteration: %d, Time: %.6f sec\" % (i, toc - tic), flush=True)\n",
    "\n",
    "            total_time += toc - tic\n",
    "            total_list.append(output[1])\n",
    "\n",
    "    # Results\n",
    "    print(\"\\n\", \"-\" * 10, \"Summary:\", \"-\" * 10)\n",
    "    latency = total_time / num_iter\n",
    "    print(\"Inference latency: %.5f seconds.\" % latency)\n",
    "\n",
    "    first_latency = np.mean([x[0] for x in total_list]) * 1000\n",
    "    average_2n = list(chain(*[x[1:] for x in total_list]))\n",
    "    average_2n.sort()\n",
    "    average_2n_latency = np.mean(average_2n) * 1000\n",
    "    p90_latency = average_2n[int(len(average_2n) * 0.9)] * 1000\n",
    "    p99_latency = average_2n[int(len(average_2n) * 0.99)] * 1000\n",
    "    print(\"First token average latency: %.2f ms.\" % first_latency)\n",
    "    print(\"Average 2... latency: %.2f ms.\" % average_2n_latency)\n",
    "    print(\"P90 2... latency: %.2f ms.\" % p90_latency)\n",
    "    print(\"P99 2... latency: %.2f ms.\" % p99_latency)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Perform inference in BF16 with 32 max new tokens\n",
    "run_inference(model, tokenizer, prompt, max_new_tokens=32, num_warmup=5, num_iter=10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Perform inference in BF16 with 128 max new tokens\n",
    "run_inference(model, tokenizer, prompt, max_new_tokens=128, num_warmup=5, num_iter=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Static Quantization INT8\n",
    "Intel® Extension for PyTorch* has APIs for performing static quantization in INT8 to further reduce the memory footprint of LLMs without sacrificing too much accuracy. These APIs are powered by Intel® Neural Compressor. The SmoothQuant technique will be used. It is a post-training quantization solution which tackles the quantization error problem caused by systematic outliers in activations.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first step is to generate a qconfig summary file. It can be done using the [Autotune API](https://github.com/intel/intel-extension-for-pytorch/blob/main/docs/tutorials/features/sq_recipe_tuning_api.md). However, due to the time it takes to run Autotune, the example below will just download a qconfig summary file for GPT-J-6B. The full list of available qconfig summary files can be found [here](https://github.com/intel/intel-extension-for-pytorch/tree/main/examples/cpu/llm/inference#2213-static-quantization-int8)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Acquire the qconfig JSON file for GPTJ-6B for quantization\n",
    "try:\n",
    "    print(\"Removing existing qconfig file if present.\")\n",
    "    os.remove(\"gpt-j-6b_qconfig.json\")\n",
    "except:\n",
    "    print(\"Note: JSON file does not exist, downloading now.\")\n",
    "!wget https://intel-extension-for-pytorch.s3.amazonaws.com/miscellaneous/llm/cpu/2/gpt-j-6b_qconfig.json"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now proceed with running SmoothQuant on the model and running the benchmark. To speed up the process, we set the environment variable OMP_NUM_THREADS to the number of physical cores on the CPU. We also use numactl and specify the physical cores list. In this example, it is set to 32 cores."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!OMP_NUM_THREADS=32 numactl -m 0 -C all python run.py --benchmark -m EleutherAI/gpt-j-6b --ipex-smooth-quant --qconfig-summary-file gpt-j-6b_qconfig.json --output-dir \"saved_results\" --max-new-tokens 32 --num-warmup 5 --num-iter 10"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Weight-Only Quantization\n",
    "Weignt-only quantization (WOQ) quantizes the model's weights but leaves the activations in full precision. This technique reduces the memory footprint and leverages optimized kernels for quantized weights for faster inference, while having minimal impact on accuracy. The 2 commands below perform INT8 and INT4 WOQ."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Weight-only quantization INT8 with ipex.llm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!OMP_NUM_THREADS=32 numactl -m 0 -C all python run.py --benchmark -m meta-llama/Meta-Llama-3.1-8B-Instruct --ipex-weight-only-quantization --weight-dtype INT8 --quant-with-amp --output-dir \"saved_results\" --max-new-tokens 32 --num-warmup 5 --num-iter 10"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Weight-only quantization INT4 with ipex.llm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!OMP_NUM_THREADS=32 numactl -m 0 -C all python run.py --benchmark -m meta-llama/Meta-Llama-3.1-8B-Instruct --ipex-weight-only-quantization --weight-dtype INT4 --gptq --quant-with-amp --output-dir \"saved_results\" --max-new-tokens 32 --num-warmup 5 --num-iter 10"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Running ipex.llm in a Distributed Manner\n",
    "Running ipex.llm in a distributed manner allows you to utlize all available cores more effectively. This is done using DeepSpeed. It is recommended to shard the model weight sizes for better memory usage when running with DeepSpeed. Sharding only needs to be done once. On subsequent runs, remove \"--shard-model\" and replace \"-m \\<MODEL_ID\\>\" with \"-m \\<sharded model path\\>\"."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Distributed BF16 with ipex.llm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!deepspeed --bind_cores_to_rank  run.py --benchmark -m meta-llama/Meta-Llama-3.1-8B-Instruct --dtype bfloat16 --ipex --autotp --shard-model --max-new-tokens 32 --num-warmup 5 --num-iter 10"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Distributed INT8 weight-only quantization with ipex.llm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!deepspeed --bind_cores_to_rank run.py --benchmark -m meta-llama/Meta-Llama-3.1-8B-Instruct --ipex-weight-only-quantization --weight-dtype INT8 --quant-with-amp --autotp --shard-model --output-dir \"saved_results\" --max-new-tokens 32 --num-warmup 5 --num-iter 10"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Conclusion\n",
    "This notebook demonstrates how easy it is to use Intel® Extension for PyTorch* to apply operator optimizations on an LLM for inference. A combination of using the channels last memory format, *ipex.llm.optimize*, and auto-mixed-precision (AMP) with BF16 on Xeon processors lead to significant performance improvements for LLM inference. For further performance improvement, we have shown how you can use static INT8 quantization and weight-only quantization in INT8 and INT4. This is applicable to text generation, translation, and summarization models, just to name a few. The strategies mentioned above are for a single socket CPU only. The next step for more performance improvement is to leverage distributed inference using DeepSpeed. "
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
